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Abstract 

This paper deals with clustering methods based on adaptive distances for histogram data using a 
dynamic clustering algorithm. Histogram data describes individuals in terms of empirical distri- 
butions. These kind of data can be considered as complex descriptions of phenomena observed 
on complex objects: images, groups of individuals, spatial or temporal variant data, results of 
queries, environmental data, and so on. The Wasserstein distance is used to compare two his- 
tograms. The Wasserstein distance between histograms is constituted by two components: the 
first based on the means, and the second, to internal dispersions (standard deviation, skewness, 
kurtosis, and so on) of the histograms. 

To cluster sets of histogram data, we propose to use Dynamic Clustering Algorithm, (based on 
adaptive squared Wasserstein distances) that is a k-means-like algorithm for clustering a set of 
individuals into K classes that are apriori fixed. The main aim of this research is to provide a 
tool for clustering histograms, emphasizing the different contributions of the histogram variables, 
and their components, to the definition of the clusters. We demonstrate that this can be achieved 
using adaptive distances. 

Two kind of adaptive distances are considered: the first takes into account the variability of each 
component of each descriptor for the whole set of individuals; the second takes into account the 
variability of each component of each descriptor in each cluster. We furnish interpretative tools 
of the obtained partition based on an extension of the classical measures (indexes) to the use of 
adaptive distances in the clustering criterion function. Applications on synthetic and real-world 
data corroborate the proposed procedure. 
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1. Introduction 

In many real experiences, data are collected and/or represented by histograms representing 
empirical distributions of phenomenon. In the framework of computer vision, the characteristics 
of images are usually represented as histograms of different masses. Other fields of applications 
also use histogram descriptions: for privacy preserving matters, data about a phenomenon (for 
example, flows of a bank account) can be summarized by histograms, as well as the dissemination 
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of official statistics. Cluster analysis aims to collect a set of objects in a number of homogeneous 
clusters according to the values they assume with respect to a set of observed variables. 

In this paper we deal with a clustering procedure to partition a set of histogram data, in 
a predefined number of clusters. Histogram data were introduced in the context of Symbolic 
Data Analysis by Bock and Diday yj and they are defined by a set of contiguous intervals of 
real domain which represent the support of each histogram, with associated a system of weights 
(frequencies, densities). 

Symbolic Data Analysis (SDA) is a domain in the area of knowledge discovery related to 
multivariate analysis, pattern recognition and artificial intelligence, aiming to provide suitable 
methods (clustering, factorial techniques, decision trees, etc.) for managing aggregated data de- 
scribed by multi-valued variables, i.e., where the cells of the data table contain sets of categories, 
intervals, or weight (probability) distributions (for further insights about the SDA approach, see 
^, 11 and II). 

Several proposals have been presented in the literature for clustering histogram data (see 
ll3],llll, S], ll3l)- Dynamic Clustering (DC) (Si,!!]) is proposed as a suitable method to partition 
a set of data represented by frequency distributions. We recall that DC needs to define a proximity 
function, to assign the individuals to the clusters, and to choose a suitable way to represent the 
clusters by an element which optimizes a criterion function. Further, the representative element 
of a cluster, called prototype, has to be consistent with the description of the clustered elements: 
i.e., if data to be clustered are distributions, the prototype must be also a distribution. 

The DC method Jit] is a general partitioning algorithm of a set of objects in K clusters. 
It looks for the solution by optimizing a criterion of best fitting between the partition and the 
representation of the clusters of such partition. In DC, the choice of a suitable dissimilarity plays 
a central role for the definition of the allocation and of representation phases, k-means algorithm 
is a particular case of DC when in the criterion function is used the squared Euclidean distance. 
According to the nature of data and the chosen dissimilarity function, DC is a more general 
schema of partition around a set of prototypes. In the case of k-means algorithm prototypes are 
the means of each cluster According to the optimized criterion, prototypes can also be regression 
lines, factorial axis, etc. 

The comparison of histogram data can be seen a particular case of comparison of distribution 
functions. Several distances and dissimilarities have been presented in the literature, some of 
these, used for histogram data, are presented in [7]. Another good review of distances between 
distributions can be found in ifioll . In the special field of computing vision, introduced 
the Earth Mover's distance (EMD) for color and texture images. This distance can be applied 
to distributions of points. It is worth of notice that EMD for histograms of pixel intensities is 
equivalent to the Mallows, or Wasserstein, distance on probability distributions ( 12|, ifisll . The 
family of distances based on Wasserstein metric permits to obtain interesting interpretative results 
based on the characteristics or on the moments of the compared distributions (see [7J for details). 
It has also proved such distance can be decomposed into two components: the first related to the 
means of the histograms, and the second to their internal dispersions. 

One of the main issues in clustering (in multivariate analysis, in general) is to take into ac- 
count the roles of the different variables. While the use of standard distances allows to find 
spherical groups (like in k-means), the main advantage of using adaptive distances is the possi- 
bility of identifying clusters of different size (in terms of variability) and orientation in the space 
(in terms of main alignment of the cluster with one ore more directions of a set of variables). 
One way is to homogenize the variables by means of a standardization step. Another way is to 
use an adaptive distance in the clustering algorithm that includes, in the optimization process, 
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the tuning of a set of weights to associate with each variable (for all the clusters or within each 
cluster). De CarvaUio and Lechevallier 114 , 15], de Souza and De Carvalho 1 1^, IitIi proposed 
several adaptive distances for the dynamic clustering of intervals and histogram data. In the pa- 
per UtI i is introduced a system of weights for each variable for Euclidean distance. The weights 
are dependent from the distance but not the optimization process so that the same schema can be 
be easily extended for the Wasserstein distance too. 



In the present paper we propose two adaptive approaches for clustering histogram data, based 
on Wasserstein distance. This metric allows to compare histogram data with respect both fre- 
quency and support while Euclidean distance compare histogram taking into account only one 
component (frequency or support). Further, Irpino and Romano 1 18] showed that is possible to 
decompose the Wasserstein distance between two histograms in two (additive and independent) 
components: the first is related to the locations of the histograms, while the second is related to 
the different variability of the two histograms. In order to take advantage from this decomposi- 
tion, we propose the following approaches for the definition of adaptive distances. 

In the first approach, we propose to associate two sets of weights for each variable and each 
component in which is decomposed the distance, the first is globally estimated for all the clusters 
at once; the second is locally estimated for each cluster 

In order to furnish clustering interpretative tools, we propose an extension of classical ratios 
based on within (intra-cluster) sum of squares, the between (inter-cluster) sum of squares and the 
total sum of squares having proved the decomposition of the inertia of a set of histogram data 
computing with the adaptive (squared) Wasserstein distances. 

This paper is organized as follows: in Section |2l we introduce the definitions of histogram 
data and of the Wasserstein distance for histograms. In Section [3] starting from the Dynamic 
Clustering Algorithm with non-adaptive distances, we propose two schemas where the adequacy 
criterion is based on adaptive squared Wasserstein distances for histogram data. In sectionH) we 
introduce some tools for the interpretation of the clustering results. In Section|5] two applications 
are shown: one using synthetic data in order to prove the usefulness of the proposed methods 
based on the variability structure of the data; the other one, using a real dataset in order to 
demonstrate the application in a real situation and to show how to interpret the results of a 
classic clustering task on histogram data. Section |6] ends the paper with some conclusions and 
perspectives about the proposed clustering methods. 



2. Histogram data and Wasserstein distance 

The clustering of data expressed as histograms can be useful to discover typologies of phe- 
nomena on the basis of the similarity of their distributions. In general, clustering techniques 
depend from the choice of a suitable dissimilarity, where the adjective suitable is related to the 
capability of the dissimilarity to take into account the nature of the data and of their represen- 
tation space. In this section, we give a definition of the histogram data and we propose the 
Wasserstein distance as a suitable metric for comparing them. 

2.1. Histogram Data 

Histogram is a cheap way for the representation of aggregate data or empirical distributions. 
Indeed, if any model of distribution can be assumed, the distribution of a set of data can be 
represented by histograms: i.e. a set of contiguous intervals (bins), of equal or of different width, 
associated with a set of weights (empirical frequencies or densities). 
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Formally, let F be a continuous variable defined on a finite support S - [miniy); Max{y)] c 
^R. The support S is partitioned into a set of contiguous intervals (bins) {/i ,...,//,,... , In}, where 
Ih = [ah',bh) where min(y) - ai and Max(y) - bn- Each //, is associated with a weight tth that 
represents an empirical (or theoretical) relative frequency. 

Let us E be considered as a set of n empirical distributions f ,(y) (i — !,...,«). In the case 
of a histogram description, it is possible to assume that S{i) - {miniy i); Maxiyi)}. Considering a 
set of intervals //„■ - {min(yhi), Maxiyhd) such that: 



0; / 5t m 



.5=1 



U 

Hi 



[miniyi); Max(yi)] 



the support can also be written as S{i) - {lu, lui, ///,/}• 

In this paper, we denote with fi{y) the (empirical) density function associated with the description 
yi and with Fi{y) its distribution function. It is possible to define the description of the / - th 
histogram for the variable Y as: 



yi = {(hi, TTii) , Qui, TTui) , {lH,i, TTh,,)) 

such that 'ilui 6 S{i) Kui - J fi(y)dy > and J fi(y)dy — 1. 

/,„ Sii) 



(1) 



In the following, we use to denote the description of the / - th histogram in the univariate case. 
If we obverse p variables, we denote with yij (where i - 1,. . .,n and j = 1 , . . . , /?) i-th histogram 
for the variable j. Thus, considering to the classic data analysis approach, the individualxvariable 
input data table contains in each cell a histogram as represented in Table [U 



Objs. 



Var 1 



Varj 



Var p 



yn 



yij 



yij 



ynj 



yip 



yip =■ 



Jk 



ynp 



Table 1 : Individualxvariable input data table of histogram data 



2.2. The Wasserstein-Kantorovich distance between two histograms 

The comparison of histogram data is a particular case of comparison of distribution functions. 
Several distances and dissimilarities have been presented in the literature. In [JJ we introduce 
several distances that can be used for comparing histogram data. Another good review on dis- 
tances between distributions can be found in |10]. The most part of these distances are based 
on the comparison of densities or frequency associated with a random variable, but the fam- 
ily of distances based on Wasserstein metric permits to obtain interesting interpretative results 
about the characteristics or the moments of the distributions (see [7] for details). Wasserstein- 
Kantorovich metric 1, 1 0.1 1,1 9,1 . and, in particular the derived L^-Wasserstein distance is a natural 
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extension of the Euclidean metric to compare distributions. If ^,(3;) and Friy) are the (empirical) 
distribution functions associated with y,- and y,', respectively, and F7^(t) and FT,^(t) (t e [0, 1]) 
their corresponding quantile functions, the L^-Wasserstein metric is defined as follows: 



dw(yi,yi') J {F;\t) - F-\t)f dt (2) 

\ 

This distance is also known as Mallows' II12II distance. If the corresponding quantile functions 
are centered by the respective sample means: Fj^{f) - F^^{t) - yt and F^^(t) = FT}{t) - y,v (Vf e 
[0, 1]), the corresponding histogram description are denoted with y'^. and y'^., and Cuesta-Albertos 
et al. Q proved that 

dl(yi, yi') = (yi - y^f + dl(y% f,) (3) 

or, in other words, the (squared) Wasserstein distance between two distributions fiiy) and fj(y) 
(or random variables), is equal to the sum of the squared Euclidean distance between their means 
(the first moments) and the squared Wasserstein distance between the two centered random vari- 
ables. The latter can be considered as a distance measure of their dispersions, i.e., represent the 
difference between two distributions except for their location. 



Irpino and Romano 01811 and Verde and Irpino [6] showed that L2 Wasserstein distance be- 
tween two generic distributions Fi{y) and ^,'(3;) can be decomposed into the following compo- 
nents (location, size and shape): 

dl(yi, y,) ^(yi-yj^ + {si-sj)^ + 2siS, [l - rQQ{Fj' , Fj})] (4) 

shape 



where: y,, and y,-, Sf are the sample means and standard deviations respectively of fiiy) 
and/;v(y). 

1 1 

/ {FT\t) - mFjMt) - yi')dt j F7Ut)F-Ut)dt - yty, 
rQQiFj\Fj}) = = (5) 

SiSi' SiSi' 

is the sample correlation of the quantiles of the two empirical distributions as represented in a 
classical QQ plot. 

A computational problem is related to the calculation of rqg, because it needs the computation 
of the quantile functions. Irpino and Verde \^ showed that the the Wasserstein distance be- 
tween histogram data depends only from the number of bins used for the histogram descriptions, 
avoiding the computational drawbacks related to the identification of the quantile functions of 
continuous distributions. 

Wasserstein distance can be used for defining an inertia measure among histograms like for 
the Euclidean metrics fS]. The total inertia with respect to the barycenter^ (which is a histogram 
whose quantiles are the means of the respective quantile of all the distributions) of a set of n 
histogram data is given by the following quantity: 
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n n L K K 

/=1 1=1 k=\ ieCt k=\ 

i.e., T can be decomposed into within (W) and between (B) clusters inertia, according to the 
Huygen's theorem of decomposition, 

where is the barycenter of the ^-th cluster and \Ck\ is the number of objects in E belonging 
to the cluster Q. We will use this property for defining tools for the interpretation of clustering 
results. 



2.2.1. Multivariate and adaptive Wasserstein distance 

Given a set Eofn objects described by p variables as in Table[Tl each yij is associated with a 
(empirical) density function fAyj), a distribution function Fiiyj) and a quantile function Fj'^itj). 
We denote with yij and i,j the sample mean and the sample standard deviation of fiiyj). The 
individual description of the / - th object is then the following vector y,: 

y,- = (7) 



Clark and Rae 112 111 present specific formulations for the multivariate Wasserstein distance when 
the distribution are Gaussians which are equipped with their covariance matrix. For all the other 
cases ll20ll . it is not possible to compute analytically a Wasserstein distance between two multi- 
variate distributions. Considering that we do not know the joint histograms for each individual 
but only the marginal histograms, we consider the multivariate squared Wasserstein distance as 
follows: 

p 

dl, (y/' y-') = Z '^w' (^'j' ^''■'■) • "^^^ 

Following the same observations derived for Eqn. (O, we may rewrite the multivariate squared 
Wasserstein distance as follows: 



In order to give a different weights to the variables we introduce adaptive distances f22 
these are distances equipped with a system of weights. Let us consider a vector of weights 
A - {A^ , . . . , A''} such that A^ > 0. According to |22] and |14], a general formulation for an 
Adaptive Single Variable (squared) Wasserstein distance is as follows: 

p 

4 (y/' yHA) = Z ^'"^'^ i^j-y^'j) ■ (^0^ 

In this formulation, the weights induce a linear transformation of the original space. Several 
approaches of these types have been proposed (see for examples iiil, UM, 1122), where the 
weights are associated to the whole set of data or where the weights are chosen locally for each 
cluster in which is partitioned the set of data. If the distance is decomposable in several com- 
ponents, it is possible to introduce a suitable system of weights for such components. While 
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the choice of weighting each variable is easily extendible to the histogram data, the choice of 
weighting components needs to be proven for such kind of data. In the present paper, starting 
from the decomposition of the squared Wasserstein distance as shown in Eqs. [3]and|4] we pro- 
pose two schemas of weighting systems for the definition of two adaptive squared Wasserstein 
distances based on the two components: the first denoted as Globally Component-wise Adaptive 
Wassertein Distance (GC-AWD), while the second is denoted as Cluster Dependent Component- 
wise Adaptive Wassertein Distance (CDC-AWD). 

3. The Dynamic Clustering Algorithm 

The Dynamic Clustering Algorithm (DCA) J^llSl is here proposed as a method to partition 
a set of data described by distributions. We recall that DCA is based on the definition of a 
criterion of the best fitting between the partition of a set of individuals and the representation 
of the clusters of the partition. The algorithm simultaneously looks for the best partition into K 
clusters and their best representation. Thus, the DCA needs the definition of a proximity function 
to assign the individuals to the clusters and the definition of a way to represent the clusters. 

The choice of the representative elements of the clusters (prototype) is done according to the 
dissimilarity function used in the algorithm to allocate the elements to the clusters, such that a 
criterion of internal homogeneity is minimized. The consistence between the representation and 
the allocation function guarantees the convergence of the algorithm to a stationary value of the 
criterion. 

According to the nature of the data, we propose to base the adequacy criterion for the dynamic 
clustering algorithm on the Wasserstein distance between histogram data. We introduce two 
schemas of Dynamic Clustering algorithms based on two proposals of adaptive (squared) Wasser- 
stein distances. In these algorithms, we propose two sets of weights computed on the whole set 
E for defining the adequacy criterion on a Globally Component-wise Adaptive Wassertein Dis- 
tance (GC-AWD) and on a Cluster Dependent Component-wise Adaptive Wassertein Distance 
(CDC-AWD). 

3.1. Adequacy criterion based on standard and adaptive distances 

Let us consider a set E of n objects described by p histogram variables. The individual 
description of the / - th object is y,- - {yn, . . . ,yip\, where yij is the histogram description of the 
/ - th individual for the j - th variable. We assume that the prototype of the cluster Ck (k - 
1, . . .,K) is also represented by a vector gi^ = (gid, . . . , gi^p), where gj^j is a histogram. As in 
the standard adaptive dynamic cluster algorithm, the proposed methods look for the partition 
P = (Ci, . . . , Ck) of E in K classes, its corresponding set of K prototypes G - (gi, . . . , g^:) and 
a set of K different adaptive distances d - {di, . . . , dK) depending on a set A of positive weights 
associated with the clusters, such that the following adequacy criterion of the best fitting between 
the clusters and their representation is locally minimized: 



As in the standard adaptive dynamic cluster algorithm, we perform a representation step 
where the prototypes are updated according to the cluster structure of the data. This is followed 
by a weighting step, allowing the definition of the weight to be associated with each variable 



K 




(11) 



k=\ ieCt 



1 



(or component) in the definition of the adaptive dissimilarity. Subsequently, an allocation step 
assigns the individuals to classes according to their proximity to the class prototypes. The three 
steps are repeated until convergence of the algorithm is achieved, i.e., until the adequacy criterion 
reaches a stationary value. 

The adequacy criteria to be minimized are based on the following distances: 

STANDARD - Standard Wasserstein distance. The standard (squared) Wasserstein distance 
between the histogram and of the prototype gk is defined as: 

p 

d{yi,gk)^Yj'^w(yipgki)- (12) 
In this case, the general criterion becomes: 

K 

A(G,P) = 224(y,-,g,) (13) 

k=l ieCt 

as no system of weights is defined. 

GC-AWD - Globally Component-wise Adaptive Wassertein Distance. The distances depends 
from two vector Ay and Aoisp of coefficients that assign weights for each component of 
each variable: 

A, = (4,..., 4) (14) 

^Disp = ('^Di.5p'---"^D,j,,)- (15) 

We define the two-component global adaptive Wasserstein distance between the descrip- 
tion y, and prototype gk as: 

d(ji, gk\A) = 4(yu - y,,f + A^^i,/wiy1j, glj) (16) 
7=1 j=i 

where y'jj and g'j^j are the centered description of yij and of gkj, as presented in Eqn. O). In 
this case, being 

Aj = (Ai,...,4) 



A = 

the general criterion is: 



(17) 



A(G, A, P) = I; ^ ^ 4(yu - y,,f + f,YjIj ^Lpdliyl, 8lj)- (18) 

k=\ ieCt j=l k=l ieCk j=l 

CDC-AWD - Cluster Dependent Component-wise Adaptive Wassertein Distance. The distances 
depends from K couples of vectors Ak^y and \k,Disp of coefficients that assign a weight for 
each component of the variable in each cluster For each cluster we define: 



We define the two-component cluster-dependent adaptive Wasserstein distance between 
the description y, and prototype as: 



(19) 



where y'jj and g'j^j are the centered description of yjj and of gkj, as presented in Eqn. O). In 
this case, being 



^k,Disp 

the general criterion becomes: 



A,, = (4_,...,iy 



•ike {!,..., K} 



k=lieCti=l '■ 

+ 2 Z i4,D.spdl(y1r8'kj)- 

k=l ieCt j=l ' ■' 



(20) 



(21) 



From an initial solution for (G , A , P ), the dynamic clustering algorithm based on adaptive 
distances alternates three steps until the criterion A reach a stationary point. 



3.1.1. Step 1: definition of the prototypes 

In the first step of the algorithm, the partition P of E in K clusters and the corresponding 
weights A are fixed. 

Proposition 1. Chosen one of the distance functions (Eqs. \12\\16\ or U^ . the vector of prototypes 
G — (gi, . . . ,gk), where — igki, ■ ■ ■ ,gkp), which minimizes the criterion A, is calculated by 
means of the quantile functions associated with each gkj. gkj is represented by a histogram with 
associated the following quantile function: 




where n^ is the cardinality of cluster Cu. It is the means of the quantile functions of the histogram 
representations of the elements belonging to the cluster Ck and it can also be written as: 

F,!%) + ys, -^Y. Fv^'itj) + ^ Y yij Vf, e [0, 1] (23) 

where F'^^'^(tj) (resp. F^^'(tj)j is associated with the centered description g'j^j (resp. y^j) of 
gkj(resp. yij) andyg^. (resp. yij) is the mean of the description of gkj (resp. yij). 
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Proof. Beeing A an additive criterion (for the variables, the components and the clusters), ac- 
cording to ll20ll . we can write: 



= min Z {yu - + Z / {Fi'^itj) - F-^l\tj)f dtj = 



{ieCt ) ieCt r-. 



Ill 



where F. ''^(fj) - F. '(f^) - yij and F^^'^itj) - Fg^(tj) - yg^.. Problem // is minimized when: 



while problem /// is minimized when, for each tj E [0, 1]: 



Then, problem / is minimized when the barycenter (prototype) histogram g^j is a histogram 
whose quantile function is: 



(24) 
□ 



3.1.2. Step 2: definition of the best distances 

In the second step, the partition P of E and the corresponding vector G of the prototypes are 
fixed. It is worth noting that the Dynamic Clustering Algorithm based on the standard (squared) 
Wasserstein distances does not require this step. 



Proposition 2. According to Diday and Govaert h2Al . the A system of weights, useful for defining 
the adaptive distances that minimize the criterion A, is calculated accordingly to the different 
schemes of adaptive distance functions used for the definition of the adequacy criterion of the 
algorithm. 

GC-AWD If the distance function is given by Eqn. M6^ , the A system of weights that minimize 
the criterion A is 



Av-(4,...,4) 



(25) 



under the following restrictions: 
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1. 4 >^andA{,.^p >0 

The coefficients of Ay and A.Disp, which satisfy restrictions (1) and (2) and minimize the 
criterion A, are calculated as: 



4 



^Disp 



k=l ieCt 



k=\ ieCt 



A=l /eCi- 



2 2 ^(y^,^^,) 

k=lieCt 



(26) 



(27) 



A^ofe f/iflf the closer are the objects to the prototypes of the clusters for the component 
related to the mean ( resp. dispersion) the higher is the respective weight. 

CDC-AWD If the distance function is given by Eqn. ( I-/9I ), the A system of weights that minimize 
the criterion A is: 



(28) 







Al,D!.sp 


A = 


^k.y 


Ak,Disp 




- ^K,y 


A-K,Disp - 



where, for each cluster, k — I, . 



,K: 



k,Disp 



- (^k,Disp' 



k,Disp' 



under the following restrictions for each cluster k — I, . . . ,K: 



1. Ai . > and Ai „. > 

k,y k,Disp 

2. , /If - = 1 andU''. , a{^. 

A i 7=1 A:,v 1 A 7=1 k,Disp 



= 1. 



The coefficients /l^- and A^jy.^^, which satisfy restrictions (1) and (2) and minimize the 
criterion A, are: 



T^iectiyij - ygkj)^ 

[Ul,Ziec,dl(y';^,gipY 

T.ieC,dl{f,j,gip 



4. 



^k,Disp 



and 



(29) 
(30) 



Note that the closer to the prototypes of a given cluster Ck are the objects for the component 
related to the mean (resp. dispersion) the higher is the respective weight for the cluster 
Ck. 
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Proof. The proof of the proposition can be performed according to fl^ using the Lagrange 
muhipHers method for the minimization of the criterion. For the sake of brevity, and without loss 
of generality, see 10] and |[l3l- □ 



3.1.3. Step 3: definition of the best partition 
In this step, prototype G and the A are fixed. 

Proposition 3. The partition P — (Ci, . . . ,Ck), which minimizes the criterion A, consists of 
clusters Ck (k - \, . . . ,K) identified according to the following allocation rules: 

STANDARD 

Ck = 1/ e E\d{yi, gu) < d(yi, g„,)} Vm k (m ^ I, . . .,K). (31) 
GC-AWD and CDC-AWD 

Q = {/ e E\d{yi, g,|A) < d(yi, g„\A)] Vm ^ k (m ^ 1, . . . , K). (32) 

In general, when d(yi, gk) = t/(y,-, g,„) or d(yi, gk\A) = d(yi, g,„|A), then i € Ck ifk < m. 
Proof. The proof of Proposition|3]is straightforward. □ 

3.2. Properties of the algorithm 

According to (@], the property of convergence of the algorithm can be studied from two 
series: z, = (G', A', P') and ss, = A(z,), f = 0, 1, . . . 

From an initial term zo - (G", A°, P"), the algorithm computes the different terms of the series 
Zi until convergence, at which point the criterion A achieves a stationary value. Here, we show 
the convergence of algorithms in terms of configuration of partition, prototypes and criterion 
A(G,A,P). 

Proposition 4. The series ss, — A(z,) decreases at each iteration and converges. 
Proof. First, we show that inequaUties (/), (//) and (///) 

(/) (//) (///) 

A (g', A', P') A (g'+' , A', P') A (g'+' , A'+' , P') A (g'+' , A'+' , ) (33) 

SS, SS,+ \ 

hold. 

Inequality (/) holds because 

A(G',A',P') = |;^^(y„g;|A') 

k=\ ieC[_ 

A(G'^',A',P') = |;2^(y,,gr'|A') 

k=l ieC[ 

and according to Proposition [T] 

p 

gf = argmin J] Z (^'> 4' 1^*'^) (k-h...,K). 
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Inequality (II) also holds because 



k=l ieC[ 

and according to Proposition |2l 

4'^'^ = argmin ^ d{yij,gl'^^\Ai^. 

Finally, inequality (///) holds because 

and according to Proposition |3] 

=flr£mm^rf(y,;,47"|A(-'>)(fe= 1,..^ (34) 

CeP(£) 

Finally, because the series ss, decreases and it is bounded (A(z,) > 0), it converges. □ 
Proposition 5. The series Zt - (C, A', P') converges. 

Proof. Let us assume that the stationary point of the series sst is achieved at the iteration t - T. 
Then, we have that zt = Zt+i and then A(z7-) - A{zt+\) (i.e., ssj = .^.^r+i )■ 

From A(z7-) = A(z7-+i), we have A(G^,A^,P^) = A(G^+', A^+', P^+i) and this equality, 
according to Proposition]?] can be rewritten as equahties (/), (//) and (///) as follows: 

(/) vn 
A (G^ A^ P^) A (G^+1 , A^, P^) 
{II) (III) (-^^^ 

"^a(g^+i,A^+i,P^)^^a(g^+i,A^+i,P^+i). 

From the first equality (/), we know that G^ - G^^' because G is unique, minimizing A when 
the partition P^ and A^ are fixed. From the second equality (//), we know that A^ = A^^' 
because A is unique, minimizing A when the partition P^ and G^^' are fixed. Then, from the 
third equality (///), we know that P^ - P^^' because P is unique, minimizing A when A^^' and 
G^^' are fixed. 

Finally, we conclude that zt - Zt+\- This conclusion holds for all f > T and Zr - Zt,^ t > T 
and then Zt converges. □ 

3.2.1. Complexity of the algorithm 

Let us denote the number of operations required for the computation of the Wasserstein dis- 
tance with D, the number of operations for computing the mean quantile function of a set of 
distributions with Q and the number of iterations of the algorithm as /. The representation step 
requires approximatively 0{np x Q) operations. The weighting step requires approximatively 
0{np X D) operations; indeed it is based on the computation of the inertia within the clusters. 
The allocation step requires approximatively 0{nKp x D) operations. Then, the complete com- 
putational cost is of order O (riKp xDx I). 
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3.3. Algorithm schema 

The procedure steps for the Dynamic Clustering Algorithm with adaptive distances for his- 
togram data are described in algorithm[T] 



4. Tools for the interpretation of the partition 

After a clustering task, it is important to evaluate the results of the procedure, in order to 
have information about the intra-class and the inter-classes heterogeneity, the contribution of 
each variable to the final partition, etc. In the framework of the standard Dynamic Cluster Al- 
gorithm, some indices, based on the ratio between the intra-cluster and the total inertia, can be 
used as tools for interpreting the clustering results Celeux et al. |23]. Such indeces have been ex- 
tended De Carvalho et al. |24] to the case of dynamic clustering with adaptive and non-adaptive 
Euclidean distances for interval-valued data. Here, we use the same class of indices for the 
evaluation of the partition of histogram data. 

In the following section, we define an extension of: 

Total inertia . It is the total sum of squared (TSS) distances of all the elements of the set E from 
the global prototype (gE)- TSS is the adequacy criterion; 

Within inertia . It is the within sum of squared (WSS) distances of the elements of a cluster 
from the respective barycenter, for all the clusters of the partition. It corresponds to the 
criterion A (G, A, P) with P - Ci, . . . ,Ck, ■ ■ ■ ,Ck', G = gi, . . . , gk, ■ ■ ■ ,gK and 



A, = (4,..., 

J^Disp = i^Disp' ■ ■ ■ , ^Disp) 



(36) 



according to the GC-AWD; 
as well as 







^l,Disp 


A = 




^k,Disp 




- ^K,y 


^K,Disp - 



according to the CDC-AWD; 

Between inertia . It is the sum of squared (BSS) distances between the prototypes of the several 
clusters and the general prototypes. Using the adaptive Wasserstein distance, and accord- 
ing to the Huygen's theorem of decomposition of inertia, we prove that the total sum of 
squares can be decomposed into two additive terms; the within sum of squares and the 
between sum of squai-es (TSS=WSS-hBSS). 
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Algorithm 1 Dynamic clustering with adaptive distances 
Require: K number of clusters, 

E asetn > K individuals described by /:> > histogram variables 

Initialization 

Set f = 

Randomly choose a partition of E into K clusters P' = (C[ , . . . , Cp of E 
Set A' = 1. 

repeat 

Step 1: Definition of the best prototypes (Representation step) 

Given P' and A', compute the G'^' set of K prototypes according to the criterion 
A(G'^\A',P') inEqn. (EJ. 

Step 2: Definition of the best distances (Weighting step) 

Given P' and G'^', compute the A'^' matrix according to Eqs. ( |26] | and (|27] ) for GC-AWD, 
and Eqs. ^ and ^ for CDC-AWD. 

Step 3: Definition of the best partition (Allocation step) 

Given G'^' and A'^' allocate each element of E to the closest prototype according to Eqs. 

(EBorlO 

test «- 
pt+i ^ pt 

for / = 1 to n do 

find the cluster C,'^' to which / belongs 
find the winning cluster C^^' such that 
k = flrgm/«<i(y,-,gJ+'|A) 

\<h<K 

itk m then 

test <- 1 

q+i ^ q+i u [i] 

end if 
end for 

set f = f + 1 
until test + 
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4.1. Total sum of squares (TSS) 

Let £ be a set of n real points, denoted as x,, where each of them is weighted by a positive 
real number denoted as ;r, and a positive number h: 



^ ^ niTTjiXi - Xjf = /! ^ 7Ti(Xi - xf (38) 



i=l j>i 1=1 

where 

n 

2 ^iXi 



X = —„ (39) 

1=1 



and h - ^j- We define the total sum of squares as: 



755 =^;r,(;c,-x)l (40) 

/=i 

If we have p > \ descriptors, the total sum of squares is: 

TSS ^Y,Y,nij{xij--x)\ (41) 

i=i '=1 

Without loss of generality, given a set £ of n individuals described by p histogram descriptors, in 
our two adapting schemes of Wasserstein distance, for a set of data clustered into K groups, we 
have two kinds of TSS and global prototypes that are consistent with Eqn. (1221) and Eqn. (HTI) . 

STANDARD According to Eq. |22l the general prototype = {gE\, ■ ■ ■ ,gEp) is a vector of 
histogram descriptions gEj whose quantile function is: 

n 

Fg'Ah)-"''J]F;\tj) Voe[0,l] (42) 



i.e., it is the histogram whose quantiles are the average of the quantiles of the elements of 
E for the j - th variable and the total sum of squares is 

TSS ^Y.Y.'^^^y'i^SEj). (43) 

;=1 ieE 

GC-AWD The general prototype is computed according to Eqs. |23]and[39l Thus, the general 
prototype g£ - (gE\,- ■ - ^gEp) is a vector of histogram descriptions gEj whose quantile 
function is: 

n n 

/=i /=i 

hEj F-J(rj) 
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the total sum of squares is 



TSScc-AWD -YuYu[4y>j-hsf + ^L/wiyip8Ej)] ■ (45) 

j=l ieE 

CDC-AWD The general prototype g£ = {gE\ , . . . , gEp) is a vector of histogram descriptions gEj 
whose quantile function is: 

FE\tj)-n'ih)+hE, (46) 

where 

K 

ygEi = 2 K , Z y/) and 

"t \, (47) 
^^«i(0) = 2 ^^^iiis^ 2 F-i(f^.) Vf^. e [0, 1]; 

/i=] 

where is the number of elements belonging to the cluster h. In this case, the total sum 
of squares is 

p K 2 
TSS CDC-AWD -Y.Y.Y.VUy'j-h.) +4,D.sp4(yir8h) ; (48) 

j=l k=l ieCi 

4.2. Within sum of squares WSS 

The within sum of squares corresponds to the minimized criterion in the algorithm. There- 
fore, we recall the same results obtained in section lTTI for the proposed two adaptive distances: 

= A(G,A,P). (49) 

4.3. Between sum of squares BSS 

For the proposed two adaptive distances, the between sum of squares are: 

STANDARD 

P K 

BSS ^YjYj '^'^'^wiSkj, gEj); (50) 



GC-AWD 

P K 

BSScc-AWD = XZ"44(3^^'. -h.f + ^LpdWuj.g-Ej)]; (51) 



CDC-AWD 

P K 



BS S CDC-AWD =22"^ \K,y (^^ft., ' y^)^ + K,Dispdwislp gEj) 



j=\ k=l 



(52) 
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Based on the quantities TSS, WSS and BSS, we can evaluate the clustering procedure as in 
1.24,1 . where a quality of partition index is considered and is defined as: 



Holding the decomposition of inertia for the Wasserstein distance, the QPI index can be written 
as follows: 

QPI-—. (54) 
^ TSS 

Equation |53] is equal to eq. |54]if and only if the total inertia can be decomposed in two compo- 
nents: the inter-cluster inertia (BSS) and the intra-cluster inertia (WSS). We showed in Eq. |6] 
that this is true for Wasserstein distance. Considering the additivity of TSS , WSS and BSS , it 
is possible to detail the QPI for each variable, for each cluster and for each component. 



5. Experimental results 

Clustering (considered an unsupervised learning task) is an explorative method applied to a 
dataset, where, in general, no information about the class structure is available to the researcher. 
Agreeing with Meila llzsll . there are many competing criteria for comparing clusterings, with 
no clear best choice. In an experimental evaluation, the quality of clustering algorithms is of- 
ten based on their performance according to a specific quality index. Experiments use either a 
limited number of real-world instances or synthetic data. While real-world data is crucial for 
testing the proposed algorithms, until now there is a lack of public repositories furnishing data 
described by histograms. This is indeed true for histogram data; histograms are composed by 
synthesized raw data, while a large repository of standard data exists. Therefore, a test bed of 
synthetic, pre-classified data must be assembled by a generator. 

In order to asses the quality of the proposed algorithms, we used synthetic data and histogram 
representations of real data. For the analysis of the synthetic data, we set up two Monte Carlo ex- 
periments that allowed the generation of two hundred datasets of histogram data of known cluster 
structure. We then evaluated the quality of each algorithm using an index for the evaluation of 
the agreement and one for the accuracy between the obtained partitions and the initial one. The 
real data analysis was performed using data from a set of meteorological stations in the People's 
Republic of China. In the following we present how the experiment have been set up and what 
indices have been used for assessing the performances of the algorithms. 

5.7. Synthetic dataset 

In this section, we will describe the performances of the two Dynamic Clustering methods 
based on adaptive squared Wasserstein distances when the data hold structures of (controlled) 
variability. In particular, we aim to show that GC-AWD has the best performance when the data 
present a different dispersion for each component of each histogram variable. Further, consider- 
ing a different dispersion (for each component of the variables) for each cluster, we aim to show 
CDC-AWD gives better results when the data present a different dispersion for each component 
of the histogram variables for each cluster 

We cannot compare the use of the Wasserstein distance with other distances for distributions, 
because it is not assured the decomposition in terms of mean and of dispersion for the other dis- 
tances presented in the literature (see [7J). 
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To this end, we set up two Monte Carlo experiments. Each experiment consisted of 100 gener- 
ations of 150 synthetic objects (3 clusters of 50 objects), described by two histogram variables. 
Each experiment was initialized by choosing, for each cluster and each variable, four parameters 
(mean, standard deviation, skewness and kurtosis), obtaining 3{clusters) x 2{vanables) baseline 
sets of four parameters. Starting from the baseline parameters, we assigned a standard deviation 
for each parameter and repeated the following steps 100 times: 

1. for each cluster and each variable, we generated 50 sets of parameters, adding a ran- 
dom error (consistent with the standard deviation of the parameter) to each baseline set 
of four parameters on the basis of the object belonging to the class, obtaining 3{clusters) x 
2{vanables) X 5Q(objects) sets of four parameters; 

2. for each object and each variable, we generated 1,000 random numbers using a Pearson 



parametric distribution (for further details see Johnson et al. 11261 Pg. 15, Eqn. 12.33]) 
based on the sets of parameters; 
3. for each object and each variable, we computed a histogram using the algorithm presented 



in lllSll . obtaining 3(clusters) x 2{variables) x 50(objects) histograms; 

4. each clustering method (the standard Dynamic Clustering and the four adaptive distanced 
based ones) was randomly initialized 50 times and we chose the best final clustering result 
(i.e., the one with lower criterion); 

5. for each best solution we computed the Corrected Rand Index li27ll to indicate the quality 
of the result with respect to the initial labels of the data. 

To measure the quality of the results furnished by the dynamic clustering algorithm consid- 
ering different adaptive distances, an external validity index can be adopted. Indeed, in this case 
the data were labeled on the basis of the generator functions used to set up the experiments. In 
this paper, we use the Corrected Rand (CR) index, defined by Hubert and Arabic [27] for com- 
paring two partitions. CR takes its values in [-1, 1] interval, where the value 1 indicates perfect 
agreement between partitions; whereas values close to zero (or negatives) correspond to clus- 
ter agreement found by chance. Further we use the accuracy index as the percentage of correct 
classified objects. 

At the end of each experiment, we report the main statistics for the CR index and for the 
accuracy. 

Experiment 1. Considering the variability of a histogram variable as a combination of the vari- 
ability of the moments associated with each histogram description of each individual, in the 
first experiment we generated data in order to obtain two histogram variables that locally (for 
each cluster) present the same histogram variability (i.e., each histogram variable for each clus- 
ter has the same variability), while globally, the two histogram variables present different vari- 
ability. In order to obtain the datasets, we fixed the baseline parameter space sampling from 
3{clusters) X Kvariables) x 4 (parameters) Normal distributions, as described in Table|2] 



In order to have a visualization of the space of the parameters, FiglT] shows the ellipses for 
each cluster and for each parameter. 

After the generation of 100 datasets according the the baseline parameters, we computed the 
CR index as well as the accuracy for each algorithm. The means and the standard deviations 
of the agreement indices are summarized in Table |3] The results of Experiment 1 show that the 
Dynamic Clustering Algorithms based on adaptive distances outperformed the classic Dynamic 
Clustering. In this case, it can be seen that GC-AWD and CDC-AWD had similar performances, 
but the algorithm GC-AWD allowed the best results in terms of the mean of CR index . 
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Table 2: Baseline settings for Experiment 1 : each couple are respectively the mean and standard deviation of the sampled 
Normal distributions used for extracting the parameters of a Pearson's family distribution that has been sampled for the 
extraction of values summarized by histogram data. 

Variable 1 : space of parameters 
Mean St.dev Skewness Kurtosis 



Mean 



Variable 2: space of parameters 
St.dev Skewness Kurtosis 



Cluster 1 (-4.8,6) (12,1.2) (-0.05,0.1) (3.10,0.1) 
Cluster 2 (-4.8,6) (9,1.2) (0.00,0.1) (3.00,0.1) 
Clusters (10.0,6) (6,1.2) (0.10,0.1) (2.95,0.1) 



( 17, 12) 
(-17, 12) 
( 0, 12) 



(6.0, 0.6) 
(4.6, 0.6) 
(3.3, 0.6) 



(0.1,0.1) 
(0.0, 0.1) 
(-0.1,0.1) 



(2.95,0.1) 
(3.00, 0.1) 
(3.10, 0.1) 



HQ 40 





Figure 1 : Experiment 1 : 95% ellipses for the bivariate distributions of the parameters of the distributions defined in Table 

El 



Table 3: Mean of the best 100 CR indices and accuracies for Expeiiment 1 

STANDARD GC-AWD CDC-AWD 

Mean Best CR (std) 0.4979 (0.0097) 0.6596 (0.0072) 0.6292 (0.0094) 

Mean Best Accuracy (std) 0.7867 (0.0410) 0.8667 (0.0340) 0.8667(0.0340) 
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Experiment 2. The second experiment was based on the generation of histogram data in order 
to obtain two histogram variables where, for each histogram variable and for each cluster there 
is different variability, while globally, the two variables presented a similar variability. In order 
to obtain the 100 datasets, we fixed the baseline parameter space sampling from 3{clusters) x 
2(yariables) x 4 {parameters) Normal distributions, as described in TableH) 
In order to have a visualization of the space of the parameters, Fig|2]shows the ellipses for each 



Table 4: Baseline settings for Experiment 2: each couple are respectively the mean and standard deviation of the sampled 
Normal distributions used for extracting the parameters of a Pearson's family distribution that has been sampled for the 
extraction of values summarized by histogram data. 

Variable 1: space of parameters Variable 2: space of parameters 

Mean St.dev Skewness Kurtosis Mean St.dev Skewness Kurtosis 

Cluster 1 ( 0.0, 0.8) (3.6, 0.3) (-0.04,0.01) (2.90, 0.03) ( 0.0, 2.3) (4.1,0.1) (0.10,0.01) (3.20, 0.03) 
Cluster2 (-0.5,1.6) (2.7,0.2) (0.03,0.01) (3.05,0.03) (-3.0,1.6) (3.4,0.2) (0.03,0.01) (3.05,0.03) 
Cluster 3 (2.8,2.4) (1.8,0.1) (0.10,0.01) (3.20,0.03) ( 1.1,0.8) (2.8,0.3) (-0.03,0.01) (2.90,0.03) 



cluster and for each parameter. 

After the generation of 100 datasets according to the baseline parameters, we computed the CR 




Figure 2: Experiment 4: 95% elhpses for the bivariate distributions of the parameters of the distributions defined in Table 

H 

index as well as the accuracy for each algorithm. The means and the standard deviations of the 
agreement indices are summarized in Table |5] 



Table 5: Mean of the best 100 CR indices and accuracies for Experiment 2 

STANDARD GC-AWD CDC-AWD 

Mean Best CR (std) 0.4255 (0.0064) 0.4248 (0.0063) 0.5376 (0.0095) 

Mean Best Accuracy (std) 0.6867 (0.0464) 0.6867(0.0464) 0.7133 (0.0452) 



The second experiment shows that the Dynamic Clustering Algorithms based on adaptive 
distances based on the schemas of CDC-AWD outperformed the classic Dynamic Clustering. It 
can be seen that CDC-AWD allowed the best results in terms of the mean CR index and accuracy, 
while GC-AWD did not outperform the STANDARD algorithm. 

We have empirical evidence that each of the two schemas of adaptive distances allows better 
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clustering results, when considering the two structures of dispersion listed at the beginning of 
the section. 

5.2. A real dataset: Climatic data from China 

In this section, we use dynamic clustering based on adaptive distances on a dataset where 
the descriptors are the distributions of the mean monthly temperature, the pressure, the relative 
humidity, the wind speed and the total monthly precipitations recorded in 60 meteorological 
stations in the People's Republic of Chinfl recorded from 1840 to 1988. For the purposes of 
this paper, we have considered the distributions of the variables for January (the coldest month) 
and July (the hottest month), so our initial data is a 60 x 10 matrix, where the generic ii,]) 
cell contains the histogram of the values for the /'' variable of the meteorological station, 
defined by means of the algorithm proposed by Irpino and Romano il8ll . Table |6] describes the 
variables and the main statistics related to the global barycenter for each variable {y and s are, 
respectively, the mean and the standard deviation of the histogram barycenter, while TSS is 
computed according the formulation of the Total Sum of Squares proposed by Verde and Irpino 
10] as a measure of variability of the histogram variable). 

Table|2]shows the main characteristics of a subset of the observed stations. The mean, standard 



tl 


Variable 


yj 




TSSj 




Mean Relative Humidity (percent) Jan 


67.9 


7.0 


127.9 


Yi 


Mean Relative Humidity (percent) July 


73.9 


4.5 


114.2 


Yi 


Mean Station Pressure (mb) Jan 


968.3 


3.6 


5864.7 


Ya 


Mean Station Pressure (mb) July 


951.1 


3.0 


5084.4 


Ys 


Mean Temperature (Cel. x 10) Jan 


-12.1 


17.3 


114.8 


Ye 


Mean Temperature (Cel. x 10) July 


252.3 


10.5 


113.5 


Yi 


Mean Wind Speed (m/s) Jan 


2.3 


0.6 


1.1 


Yi 


Mean Wind Speed (m/s) July 


2.3 


0.5 


0.6 


Y9 


Total Precipitation (mm) Jan 


18.2 


14.3 


519.6 




Total Precipitation (mm) July 


144.6 


80.8 


499.9 



Table 6: Basic statistics of the histogram variables: y'j and Sj are the mean and the standard deviation of the barycenter 
histogram for the /'' variable, while TSSj is the Total Sum of Squares for the /'' variable. 

deviation, the skewness and the kurtosis are reported for each histogram description. The last 
two measures correspond to the third and fourth standardized moments. Table|7]shows also the 
spatial coordinates of the stations (longitude, latitude and elevation in meters). This information 
is not relevant to the analysis except for the interpretation of the obtained cluster, indeed the 
spatial coordinates do not play an active role in the analysis. In this case, we use clustering in an 
explorative fashion. For the sake of brevity, we show the results of only one method. The choices 
for the method and number of clusters were made according to the maximum value observed for 
the Calinski and Harabasz |28] index (CH). The CH index is a validity index that is generally 
used for the determination of the number of clusters, and can be viewed as a Pseudo-F index. If 
K is the number of clusters of a partition of a set of n individuals, WS S (K) the Within Sum of 
Squares and BSS(K) the Between Sum of Squares, the CH index is computed as follows: 

^ BSSiK)/iK-l) 
WSS(K)/(n-Ky 



'Dataset URL: http : //dss .ucar . edu/datasets/ds578 . 5/ 
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Table 7: Dataset from 60 climatic stations in China: main characteristics for each station. 























y. 














ID 


Sl.Name 


Long. 


Lat. 


Elev. 


Mean 


St.dev. 


Skew. 


Kurt. 


Mean 


Sl.dev. 


Skew. 


Kurl. 


Mean 


Sl.dev. 


Skew. 


Kurt. 


1 


Hailaer 


119.75 


49.22 


612.80 


781.77 


59.34 


1.26 


5.78 


708.21 


54.58 


-0.88 


3.28 


9,477.82 


25.02 


-0.72 


3.77 




NenJiang 


125.23 


49.17 


242.20 


740.16 


48.88 


0.31 


3.24 


779.10 


38.95 


-0.32 


1.89 


9,920.16 


28.91 


-0.13 


2.42 


3 


BoKeTu 


121.92 


48.77 


739.40 


690.43 


46.44 


1.22 


5.68 


782.71 


42.46 


-0.51 


2.32 


9,318.44 


39.43 


-0.04 


2.13 


4 


QiQiHaEr 


123.92 


47.38 


145.90 


688.03 


67.16 


0.11 


3.08 


727.01 


57.91 


-0.53 


3.16 


10,051.41 


26.14 


-0.10 


2.33 


59 


ZhanJiang 


1 10.40 


21.22 


25.30 


783.32 


64.65 


-0.98 


3.59 


808.81 


20.74 


-0.06 


2.90 


10,166.05 


20.78 


-0.34 


2.95 


60 


HaiKou 


110.35 


20.03 


1410 


851.33 


41.81 


-0.62 


3.23 


819.64 


28.38 


-0.48 


3.97 


10.175.49 


26.03 


0.66 


4.59 









F4 








Ys 














ID 


STNAME 


Mean 


Sl.dev. 


Skew. 


Kurl. 


Mean 


Sl.dev. 


Skew. 


Kurt. 


Mean 


St.dev. 


Skew. 


Kurt. 


1 


Hailaer 


9.339.85 


16.68 


-0.39 


3.67 


-275.31 


32.40 


-0.86 


3.72 


200.71 


13.60 


0.96 


4.47 


2 


NenJiang 


9,756.11 


18.44 


0.50 


2.66 


-254.96 


26.44 


-0.11 


2.42 


206.27 


9.76 


0.44 


2.70 


3 


BoKeTu 


9,224.33 


29.67 


0.71 


2.56 


-216.13 


26.50 


-0.69 


3.77 


179.46 


10.87 


0.66 


4.04 


4 


QiQiHaEr 


9,861.43 


15.68 


0.08 


1.95 


-197.70 


23.36 


-0.49 


2.64 


227.18 


10.68 


0.06 


2.69 


58 


NanNing 


9,946.66 


19.69 


-1.41 


4.35 


129.52 


18.88 


-0.69 


3.18 


283.90 


5.96 


-0.68 


4.29 


59 


ZhanJiang 


10,012.91 


13.96 


0.00 


3.34 


157.52 


16.68 


-0.45 


2.66 


288.28 


5.42 


0.12 


2.48 


60 


HaiKou 


10,028.93 


18.09 


0.44 


3.63 


174.09 


17.78 


0.02 


2.55 


284.05 


5.49 


-0.27 


3.11 

















Ys 








Y, 














ID 


STNAME 


Mean 


Sl.dev. 


Skew. 


Kurl. 


Mean 


Sl.dev. 


Skew. 


Kurt. 


Mean 


St.dev. 


Skew. 


Kurt. 


Mean 


St.dev. 


Skew. 


Kurt. 


1 


Hailaer 


18.80 


7.60 


0.77 


3.47 


27.84 


6.51 


1.25 


6.66 


34.79 


24.92 


1.27 


4.55 


886.70 


424.09 


0.43 


3.63 




NenJiang 


15.99 


6.78 


1.25 


3.67 


26.97 


7.37 


0.64 


2.72 


31.67 


22.05 


1.19 


4.34 


1,397.45 


683.74 


0.73 


4.42 


3 


BoKeTu 


33.97 


7.44 


0.01 


3.14 


20.44 


4.32 


0.46 


2.94 


25.41 


23.90 


1.06 


3.40 


1,362.48 


605.98 


0.86 


3.32 


4 


QiQiHaEr 


28.32 


6.89 


0.36 


2.52 


29.71 


5.09 


0.03 


2.64 


17.49 


17.74 


1.17 


4.52 


1,321.97 


753.20 


0.77 


3.42 


58 


NanNing 


16.83 


433 


-0.04 


2.57 


20.37 


4.04 


-0.30 


2.30 


317.73 


293.68 


1.59 


6.26 


2,003.83 


872.76 


0.36 


2.77 


59 


ZhanJiang 


31.01 


9.34 


1.27 


5.62 


28.86 


5.67 


0.23 


2.70 


221.59 


248.06 


1.49 


4.87 


2,080.67 


1,146.92 


0.56 


3.08 


60 


HaiKou 


32.65 


8.63 


0.62 


2.53 


25.84 


6.62 


-0.15 


2.14 


212.79 


185.36 


0.77 


2.45 


1,899.15 


1,02941 


1.46 


6.26 



In this case, we executed 100 initializations for each clustering algorithm and a number of clus- 
ters of the partition going from 2 to 10. Among the three algorithms, the method that allowed 
a good behavior for the CH index was CDC-AWD. Indeed, as it is shown in Fig. [3] it reaches 
a maximum when we choose a partition in K - 2t clusters, while the other methods do not 
present an absolute maximum suggesting a scarce cluster structure of the data. For K - CDC- 
AWD registered an high value for the QPI index (0.928), while the dynamic clustering based 
on non-adaptive distances (STANDARD) had a QPI equal to 0.873, and GC-AWD had QPFs 
respectively equal to 0.825. Table[8]shows the final set of A weights associated with the compo- 
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Figure 3: CH indices for the three algorithms and for K = 2, . . . , 10. 

nents (the mean and the dispersion)of the distance in each cluster according to the CDC-AWD 
algorithm. The A's can be considered as normalization factors for the component of the variables; 
indeed, we can see a high value for the weights associated with the variables Yj and Fg (the Wind 
Speed in January and in July) that have, in general, a lower TSS . Because we used the CDC- 
AWD algorithm, the weights take into account the within cluster variability of the components 
of the variables according to each cluster, so, in this case, we must read the weights cluster by 
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Figure 4: Map of the 60 stations clustered into K = & classes using CDC-AWD. Each station is marlced with a symbol 
that represents the cluster to which it belongs. 



cluster. Indeed, the constraint of product equal to 1 related to the weights computed according to 
Eqs. |29]and[30]are referred to each cluster and cannot generally be compared across the different 
clusters. Figure|4]is the map of the 60 stations. Each station is marked with a symbol thatrepre- 



Table 8: Weights generated by CDC-AWD on China data, K = ni^'s are the cardinality of the obtained clusters. 
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sents the cluster to which it belongs. Note that the clusters represent geographical areas that are 
consistent with the climate zones of China. 

In order to comment on the clustering results and considering the additive properties of the 
WSS , the BSS , and (as consequence) TSS with respect to the components of the variables, the 
variables and the clusters, we present different versions of the QPI in Table |9] as proposed by 



De Carvalho et al. 112411 : 



The QPPs denoted by and for the components and by for the vari- 

ables in each cluster, allow us to measure the homogeneity of a cluster with respect to the 
components and the variables: the closer they are to 1, the more likely the cluster is to 
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Figure 5 : The general prototype and the cluster prototypes for the variable F4 : Mean Station Pressure (mb) in July. 



contain similar objects. 

The QPI's denoted by and for the components and by |||^ for all the vari- 

ables in each cluster, allow the measurement of the homogeneity of a cluster with respect 
to all the components and all the variables: the closer they are to 1, the more likely the 
cluster is to contain similar objects for all the components or all the variables. 

/ BSSy BSSois,, \ I BSS \ 

• The QPI's related to each component I jsj^ '^^^ tsSd ' ) ^^'^ ^^'^^ variable I I for all 
the clusters, allow us to understand the contribution of each component or of each variable 
to the cluster separation (The closer the QPI is to 1, the more the clusters are separated). 

and jgg'^''' 1 and the QPI related to all 

the variables allow us to evaluate the global quality of the clustering results for each 
component and each variable. 

In general, the CDC-AWD algorithm reaches a QPI - ||| - 0.928 for a number of clus- 
ters equal to = 8. Specifically, the mean component of the variables allows higher homo- 
geneity within clusters ^j^: = 0.932 j, while the dispersion component has a medium effect 

( raSp"'' ~ 0.553^. Considering the second group of QPI indices from the results presented in 
Table |9] cluster 3 contains objects that are very similar For dispersion structure, cluster 2 is 
slightly more globally heterogenous. Considering the effect of each component in clustering 
the current dataset, the two variables that allow better homogeneity in the cluster are Yi, and I4 
(Mean Station Pressure in January and in July), while the worst results are observed for Yj (Mean 
Wind Speed in January) and Y2 (Mean Relative Humidity in July). 

Finally, in Figs. |5]and|6] we show the prototypes of the obtained clusters for the better (in terms 
of QPI) and for the worst variables in partitioning the 60 stations into K - 2i clusters. 
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Table 9: QPIs resulting from CDC-AWD, which has been used for partitioning China's stations into K = 8 clusters. The 
QPIs are detailed for each component, for each variable and for each cluster. 
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Figure 6: The general prototype and the cluster prototypes for the variable Yy: Mean Wind Speed (m/s) in January. 

6. Conclusions 

In this paper, we presented two new algorithms for the Dynamic Clustering of histogram 
data based on two adaptive squared Wasserstein distances. The adaptive clustering dynamic 
algorithm locally optimizes an adequacy criterion that measures the fitting between the clusters 
and their representatives (the barycenters) based on distances that change at each iteration. The 
first algorithm (GC-AWD) uses a globally adaptive squared Wasserstein distance for each one of 
the two (mean and dispersion) components of the quantile functions of histograms. The second 
algorithm (CDC-AWD) uses a locally adaptive squared Wasserstein distance for each of the two 
(mean and dispersion) components of the quantile functions of histograms that changes according 
to each cluster The advantages of using such adaptive distances is the ability to identify clusters 
of different sizes and shapes, while standard DCA (like in k-means case) finds spherical clusters. 
Starting from an initial random partition of the objects, the adaptive DCA alternate three steps 
until convergence when the adequacy criterion reaches a stationary value, which represents a 
local (within sum of squares) minimum. In the first two steps, the algorithms give the solution 
for the best prototype of each cluster as well as the solution for the best adaptive distance (locally 
for each cluster). In the last step, the algorithm gives the solution for the best partition. The 
convergence as well as the time complexity of the algorithms were addressed. 
Two experimental evaluations of the proposed methods were presented. The first was performed 
using two baseline settings for the generation of two hundreds synthetic datasets in order to show 
the usefulness of each scheme of adaptive Wasserstein distance in identifying a starting class 
structure in the data. The second, using a real-word dataset, showed the use of such algorithms 
in an exploratory fashion. All the algorithms based on adaptive distances were compared with the 
standard dynamic clustering algorithm (i.e., based on a standard squared Wasserstein distance). 
The experiment conducted on artificial data showed that the adaptive algorithms outperformed 
the standard one in terms of accuracy in identifying the initial class structure of the data. The 
experiment on the real-word dataset also demonstrated that the adaptive algorithms were able 
to reach a good quality of partition that was generally greater than the standard non-adaptive 
algorithm. 
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